Overview

It has been seen from our applications of cellcycleR to single cell level data that the method has a tendency to pick up those genes as influential ones which have very high expression patterns under some cell states. Then the ordering is carried out based on the expression levels of these genes and the method tries to arrange the cells in the order based on the peaks of their expressions at these cell states (clumps these cell states together). As a result of this, the method is unable to detect true sinusoidal patterns in the data.

Simulation Example (moderate SNR)

Consider the following example

library(cellcycleR)
library(wavethresh)
## Loading required package: MASS
## WaveThresh: R wavelet software, release 4.6.6, installed
## 
## Copyright Guy Nason and others 1993-2013
## 
## Note: nlevels has been renamed to nlevelsWT
G <- 100;
num_cells <- 128;
amp_genes <- rep(1, G);
phi_genes <- rep(c(2,5), each=G/2);
sigma_genes <- rchisq(G, 1);
cell_times_sim <- seq(0,2*pi, length.out=num_cells);
cycle_data <- sim_sinusoidal_cycle(G, amp_genes, phi_genes, sigma_genes, cell_times_sim);

celltime_levels <- 128;

Notice from the simulation design so far that we have picked most fenes to have high noise variation compared to signal variation. This is because in real cell cycle data, we do see a lot of noise variation.

Now we assume the first \(10\) genes to be ribosomal genes. We pick 10 cell states (uniformly spaced) where these ribosomal genes have higher expression.

n_states <- 10;
n_ribo <- 10;
cell_states <- sapply(1:n_states, function(s) floor((s*length(cell_times_sim))/n_states));
for(m in 1:n_states){
  for(n in 1:n_ribo){
    cycle_data[cell_states[m],n] <- rnorm(1,8,1);
  }
}

We now observe the graphs of a ribosomal gene and a non-ribosomal gene.

plot(cycle_data[,2], type="l")

plot(cycle_data[,40], type="l")

We now reorder the data.

sample_reorder <- sample(1:num_cells,num_cells, replace=FALSE);
cell_times_reorder <- cell_times_sim[sample_reorder];
cycle_data_reorder <- cycle_data[sample_reorder,];

cellcycleR method- sinusoidal

system.time(out_sinusoidal <- sin_cell_ordering_class(cycle_data_reorder, celltime_levels = celltime_levels, num_iter=500))
## Final loglikelihood, iter 500 : -8487.51456199
##    user  system elapsed 
## 317.627 153.287 168.335
library(plotrix)
## Warning: package 'plotrix' was built under R version 3.2.3
library(RColorBrewer)
radial.plot(lengths=1:length(out_sinusoidal$cell_times),radial.pos=out_sinusoidal$cell_times[order(cell_times_reorder)],
            line.col=colorRampPalette(brewer.pal(9,"Blues"))(length(out_sinusoidal$cell_times)), lwd=2)

radial.plot(lengths=1:length(cell_times_reorder),radial.pos=sort(cell_times_reorder),
            line.col=colorRampPalette(brewer.pal(9,"Blues"))(length(cell_times_reorder)), lwd=2)

The plots of estimated gene pattern and the true gene pattern.

2nd co-ordinate

plot(cycle_data_reorder[order(out_sinusoidal$cell_times),2], type="l")

plot(cycle_data[,2],type="l")

30th coordinate

plot(cycle_data_reorder[order(out_sinusoidal$cell_times),30], type="l")

plot(cycle_data[,30],type="l")

50th coordinate

plot(cycle_data_reorder[order(out_sinusoidal$cell_times),50], type="l")

plot(cycle_data[,50],type="l")

Simulation Example (low SNR)

Consider the following example

library(cellcycleR)
library(wavethresh)

G <- 100;
num_cells <- 128;
amp_genes <- rep(1, G);
phi_genes <- rep(c(2,5), each=G/2);
sigma_genes <- rchisq(G, 3);
cell_times_sim <- seq(0,2*pi, length.out=num_cells);
cycle_data <- sim_sinusoidal_cycle(G, amp_genes, phi_genes, sigma_genes, cell_times_sim);

celltime_levels <- 128;

Notice from the simulation design so far that we have picked most fenes to have high noise variation compared to signal variation. This is because in real cell cycle data, we do see a lot of noise variation.

Now we assume the first \(10\) genes to be ribosomal genes. We pick 10 cell states (uniformly spaced) where these ribosomal genes have higher expression.

n_states <- 10;
n_ribo <- 10;
cell_states <- sapply(1:n_states, function(s) floor((s*length(cell_times_sim))/n_states));
for(m in 1:n_states){
  for(n in 1:n_ribo){
    cycle_data[cell_states[m],n] <- rnorm(1,8,1);
  }
}

We now observe the graphs of a ribosomal gene and a non-ribosomal gene.

plot(cycle_data[,2], type="l")

plot(cycle_data[,40], type="l")

We now reorder the data.

sample_reorder <- sample(1:num_cells,num_cells, replace=FALSE);
cell_times_reorder <- cell_times_sim[sample_reorder];
cycle_data_reorder <- cycle_data[sample_reorder,];

cellcycleR method- sinusoidal

system.time(out_sinusoidal <- sin_cell_ordering_class(cycle_data_reorder, celltime_levels = celltime_levels, num_iter=500))
## Final loglikelihood, iter 500 : -27495.4719848
##    user  system elapsed 
## 326.457 170.612 179.124
library(plotrix)
library(RColorBrewer)
radial.plot(lengths=1:length(out_sinusoidal$cell_times),radial.pos=out_sinusoidal$cell_times[order(cell_times_reorder)],
            line.col=colorRampPalette(brewer.pal(9,"Blues"))(length(out_sinusoidal$cell_times)), lwd=2)

radial.plot(lengths=1:length(cell_times_reorder),radial.pos=sort(cell_times_reorder),
            line.col=colorRampPalette(brewer.pal(9,"Blues"))(length(cell_times_reorder)), lwd=2)

The plots of estimated gene pattern and the true gene pattern.

2nd co-ordinate

plot(cycle_data_reorder[order(out_sinusoidal$cell_times),2], type="l")

plot(cycle_data[,2],type="l")

30th coordinate

plot(cycle_data_reorder[order(out_sinusoidal$cell_times),30], type="l")

plot(cycle_data[,30],type="l")

50th coordinate

plot(cycle_data_reorder[order(out_sinusoidal$cell_times),50], type="l")

plot(cycle_data[,50],type="l")

Conclusion

It seems that cellcyleR is not doing a bad job at extracting the patterns in the data even when the ribosomal genes which result in high expression at certain specific cell states are present.